Non-Markovian Control with Gated End-to-End Memory Policy Networks

نویسندگان

  • Julien Perez
  • Tomi Silander
چکیده

Partially observable environments present an important open challenge in the domain of sequential control learning with delayed rewards. Despite numerous attempts during the two last decades, the majority of reinforcement learning algorithms and associated approximate models, applied to this context, still assume Markovian state transitions. In this paper, we explore the use of a recently proposed attention-based model, the Gated End-to-End Memory Network, for sequential control. We call the resulting model the Gated End-to-End Memory Policy Network. More precisely, we use a model-free value-based algorithm to learn policies for partially observed domains using this memory-enhanced neural network. This model is end-to-end learnable and it features unbounded memory. Indeed, because of its attention mechanism and associated non-parametric memory, the proposed model allows us to define an attention mechanism over the observation stream unlike recurrent models. We show encouraging results that illustrate the capability of our attention-based model in the context of the continuous-state non-stationary control problem of stock trading. We also present an OpenAI Gym environment for simulated stock exchange and explain its relevance as a benchmark for the field of non-Markovian decision process learning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Non-Preemptive Two-Class M/M/1 System with Prioritized Real-Time Jobs under Earliest-Deadline-First Policy

This paper introduces an analytical method for approximating the performance of a two-class priority M/M/1 system. The system is fully non-preemptive. More specifically, the prioritized class-1 jobs are real-time and served with the non-preemptive earliest-deadline-first (EDF) policy, but despite their priority cannot preempt any non real-time class-2 job. The waiting class-2 jobs can only be s...

متن کامل

A Multiprocessor System with Non-Preemptive Earliest-Deadline-First Scheduling Policy: A Performability Study

This paper introduces an analytical method for approximating the performability of a firm realtime system modeled by a multi-server queue. The service discipline in the queue is earliestdeadline- first (EDF), which is an optimal scheduling algorithm. Real-time jobs with exponentially distributed relative deadlines arrive according to a Poisson process. All jobs have deadlines until the end of s...

متن کامل

Synchronization criteria for T-S fuzzy singular complex dynamical networks with Markovian jumping parameters and mixed time-varying delays using pinning control

In this paper, we are discuss about the issue of synchronization for singular complex dynamical networks with Markovian jumping parameters and additive time-varying delays through pinning control by Takagi-Sugeno (T-S) fuzzy theory.The complex dynamical systems consist of m nodes and the systems switch from one mode to another, a Markovian chain with glorious transition probabili...

متن کامل

Dipyridamole stress and rest gated 99mTc-sestamibi myocardial perfusion SPECT: left ventricular function indices and myocardial perfusion findings

  Introduction: We investigated the difference in left ventricular ejection fraction (LVEF) and end-systolic volume(ESV) measured by gated myocardial perfusion SPECT (GSPECT) in the post-dipyridamole stress and rest periods, and compared the results with the perfusion patterns found in the conventional non-gated tomograms. Methods: 297 consecutive patients were studie...

متن کامل

Performance Modeling of Blocking Probability in Multihop Wireless Networks

Ad-hoc network communication requires efficient routing protocols to overcome the problems associated with unpredictable and dynamically changing topologies, which are mostly triggered by nodes mobility and non-existence of base stations and central controllers. We propose a performance evaluation model for blocking probability of multihop calls in ideal macrocell environment conditions using t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1705.10993  شماره 

صفحات  -

تاریخ انتشار 2017